-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect mismatches in begin and end tokens returned by JSON tokenizer FST #17471
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
/ok to test |
cpp/src/io/json/json_tree.cu
Outdated
cuda::proclaim_return_type<std::uint8_t>([] __device__(auto token) -> std::uint8_t { | ||
std::uint8_t token_bits{0}; | ||
switch (token) { | ||
case token_t::StructBegin: [[fallthrough]]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for [[fallthrough]]
when the case is empty. I misunderstood that guideline at first so we have a lot of unnecessary [[fallthrough]]
s in the code 😬
cpp/src/io/json/json_tree.cu
Outdated
CUDF_EXPECTS(h_tokens[0] == token_t::ValueBegin, | ||
"Some begin token does not have a matching end token"); | ||
} else { | ||
auto not_ok = thrust::transform_reduce( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this work when we have two (same) begin tokens without any end tokens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic has been replaced a check on the logical stack - if the logical stack is non-empty after the bracket-brace FST then the last record is incomplete.
When the last JSONL record in the input is incomplete, and error recovery is set to FAIL, then an error token is not emitted after the entire input is read, which gives rise to above bug. Even though a newline character is inserted after the last record, error tokens are emitted on line break only when |
/merge |
Description
Addresses #15820
Checklist